home *** CD-ROM | disk | FTP | other *** search
- Word Check Module (Formally SpellChk)
- ==================
-
- RiscOS 3 version 0.03 © Geoff. Lane. Mar 1992
- Internet: zzassgl@uts.mcc.ac.uk
- Janet : zzassgl@uk.ac.mcc.uts
-
- (The information given here may not exactly match the current state of
- the module.)
-
-
- Introduction
- ------------
-
- This implements a word spelling check module for British/American
- technical English (with a selection of Arc specific words added for
- good luck) or other (unsupplied)languages. It is based on a
- description in Chapter 13 of "Programming Pearls" by Jon Bentley (ISBN
- 0-201-10331-1) of a clever algorithm devised by Doug McIlroy in 1978.
-
- Many RiscOS based word processors contain a spelling checker; each has
- its own word list and interface. These checkers are impossible to use
- outside the application. For each application there will be one or
- more large diictionary files each unique to the application. If a
- standard interface were created using SWIs it would be possible of one
- module to provide Word check function to all applications that
- required such a facility. If an application could rely on a Word
- check module being available then the application would be smaller.
- The facilities provided within this module are a first attempt to
- define such an interface.
-
- This program was created and tested on a 2M A5000 machine running
- RiscOS 3. The compiler used was Norcroft C version 3.0. The shared
- library version used was 3.87 (I haven't included any RMEnsure
- commands in the !Boot or !Run files - I'm not aware of any version
- dependancies) If you don't have the C library in ROM then you should
- edit the files to check for and load the shared library.
-
-
-
- Algorithm
- ---------
-
- The algorithm allows a dictionary to be highly compressed by encoding
- each word as a unique 32 bit number. The resulting list of numbers is
- sorted and then a table of differences is created. This table of 16
- bit numbers is included in the module. To find a word in the table
- you encode the word and then see if it is possible to generate the
- same value by summing the contents of the table - if you get a match
- then the word is valid. (Of course you have to chose the encoding
- algorithm quite carefully to ensure that the vast majority (in this
- case only 19 words hash to the same value) of words translate to
- unique numbers and the differences between each pair of sorted numbers
- is always less than 32K.) Thus a module of about 71K bytes can be used
- to check the spelling of about 32000 different words.
-
-
- Building Word Lists
- -------------------
-
- The word lists were created from a number of sources...
-
- * Take 1.5 Gbytes of Netnews.
- * Take the Brown University Corpus of English Usage.
- * Take the Unix online manual pages.
- * Take the RiscOS help text.
-
- Delete the non-alphabetics, sort and delete duplicates. From this you
- obtain a huge list of words plus a lot of junk. You pass this list
- through a standard spelling checker and then check the reject list for
- words that are useful but not accepted.
-
-
- Features
- --------
-
- * Installed as a module.
-
- * Large dictionary but small memory requirements. (To an old BBC B
- programmer the fact that I can even contemplate dedicating over 64K of
- memory to a utility is slightly obsene.)
-
- * General purpose. The spelling check is available to any program (or
- module) and not restricted to a particular application.
-
- * Multiple language support. British English, {work in progress
- American English} are supplied. The language to be used in a given
- run can be selected by command. The (changable) default is British
- English.
-
- * Fairly fast. On an A5000 the current version could check about 500
- words/second when running a test program within a Task Window and
- about 620 wps running "native". (It is interesting to know that in the
- description of the algorithm in the book mentioned above the speed is
- described as about 170 wps on a VAX 11/750 - this was considered fast
- at the time the book was written! The VAX version was just under 64K
- in size - the dictionary was a bit smaller.)
-
-
- Bugs and/or Misfeatures
- -----------------------
-
- (It's not as bad as it looks. The module is only intended to
- implement basic spelling checks; clever preprocessing should be done
- by the application program and not set in stone within the module.)
-
- * Does not perform pre- or post-fix stripping.
-
- * Can't cope with many plurals (special case of the lack of post-fix
- stripping.)
-
- * Complains about what it believes to be uncorrectly capitallised
- words (Many "standard" capitallisations are encoded in the word list.)
-
- * Does not check single character words (i,a,...)
-
- * Currently not possible to supply a personal dictionary to be added
- to the standard pre-loaded dictionary.
-
- * The algorithm used can miss a small proportion of bad spellings.
- (About 1 in 1000 misWorded words will get through.) This is a result
- of the way that the words are encoded -- the error rate could be
- reduced at the cost of a larger word table but then the major
- advantage of having a small module size (and thus speed) are lost.
-
- * Anagram solver will be limited and quite slow.
-
- * Word finder is limited and quite slow. It operates by using a
- brute force search using all possible words that may exist which fit
- the supplied partial word. Most of the possibilities are incorrect
- spellings so it pushes the checking algorithm to it's limits and thus
- reports more incorrect words than it should.
-
- * Difficult to use as a spelling corrector. The module cannot suggest
- close matches to a supplied word as the encoding algorithm generates
- unrelated hash values for similar words; in addition the original word
- list is not available to the module at run-time.
-
-
- Configurable Bits
- -----------------
-
- * The default dictionary language used when the module is loaded can
- be changed by altering the "WordChk$DefLang" environment variable in
- both !Run and !Boot files.
-
- * New languages can be installed by adding the encoded dictionaries to
- the "Languages" sub-directory. They can then be specified as the
- default language or loaded with the *WordLoad command.
-
-
- Command Interface
- -----------------
-
- This allows a single word entered from the command line to be checked
- against the current dictionary.
-
- *WordCheck <word>
- ok/unknown
-
- {Work in progress} This treats the supplied word as an anagram and
- tries to rearrange it into words that are found in the current
- dictionary.
-
- *WordGram <word>
-
- This takes a word with missing characters (indicated by ?'s in the
- string) and tries to find matching words in the current dictionary.
-
- *WordFind <partial word>
-
- This loads a new language as the current dictionary. At the moment
- valid languages are "British" {work in progress, "American" and
- "Technical".}
-
- *WordLoad <language>
-
-
- Program Interface
- -----------------
-
- The module provides the following SWIs...
-
- "WordCheck_Word"
-
- Input
-
- R0 pointer to string to test (null byte terminated character
- string as generated by BASIC V or C)
-
- Output
-
- R0 preserved
- R1 returns boolean (-1/TRUE or 0/FALSE)
-
- BASIC Example
-
- SYS "WordCheck_Word","syzygy" TO ,valid%
-
- returns valid% = -1/TRUE (honest)
-
- whereas
-
- SYS "WordCheck_Word","pointer" TO ,valid%
-
- returns valid% = 0/FALSE (shame!)
-
-
- "WordCheck_Find"
-
- Input
-
- R0 pointer to string to test with up to three '?' characters indicating
- unknown characters (null byte terminated character string as
- generated by BASIC V or C) To obtain further possible matches use
- "WordCheck_FindNext"
-
- Output
-
- R0 preserved
- R1 returns first found word or null string if nothing found.
-
- BASIC Example
-
- SYS "WordCheck_Find","te?t" TO ,match$
-
- returns match$ = "teat"
-
-
- "WordCheck_FindNext"
-
- Output
-
- R1 returns next matching word or null string if nothing found.
-
- BASIC Example
-
- This assumes that "WordCheck_Find" has been called with an initial
- partial word of "te?t".
-
- SYS "WordCheck_FindNext" TO ,match$
-
- returns match$ = "tent"
-
- Further matches can be obtained by more calls to "WordCheck_FindNext"
- until the end of all possible matches is indicated by the return of a null
- string. For instance,in BASIC, to find all matches use code similar to...
-
- SYS "WordCheck_Find","te?t" TO ,m$
- WHILE m$ <> ""
- PRINT m$
- SYS "WordCheck_FindNext" TO ,m$
- ENDWHILE
-
-
- "WordCheck_Load"
-
- Input
-
- R0 pointer to string holding language name to load (null byte
- terminated character string as generated by BASIC V or C.) The
- corresponding named language file must be present in the Languages
- sub-directory within !WordChk.
-
- Output
-
- R0 preserved
- R1 returns -1/TRUE if successful otherwise 0/FALSE.
-
- BASIC Example
-
- SYS "WordCheck_Load","British" TO ,ok%
-
- returns ok% = -1/TRUE if language "British" has been loaded.
- returns ok% = 0/FALSE if failed to load new language.
-
-
- Building New Dictionary Files
- -----------------------------
-
- [[[ NOT IN THIS VERSION ]]]
-
- A program, BuildDict, is provided which can create new encoded
- dictionary files from word lists. These files can then be loaded into
- the module. To create a new dictionary you need to do the following...
-
- * Gather a word list. There must be at least 256 words in the list
- and there will probably have to be many more words in order that
- the difference between the hash values is always < 64K. There
- should not be more than 33000 words in the list.
-
- * Delete single character words and ensure that there are no
- leading or trailing spaces or tabs at the end of the words.
- There should only be one word per line.
-
- * Sort the list (not essential for BuildDict but needed for
- following step.)
-
- * Delete duplicate words.
-
- * Place the file in the WordLists sub-directory of !WordChk.
-
- * Run the BuildDict program. This will create, if successful, an
- encoded dictionary file in the Languages sub-directory. There are
- a number of possible fatal errors that may occur during
- processing.
-
- * Change the WordChk$DefLang value set in !Boot and !Run to make
- your new language the default or use the *WordLoad command to
- load the new language into a running module.
-
- The hash algorithm has been optimised for UK/US English. It not be
- suitable for other languages. A future version of !WordChk may
- include a means to alter and re-optimise the hash algorithm if
- necessary for each language to be loaded.
-
- Foreign Languages
- -----------------
-
- True spelling checkers for foreign languages are complicated by the
- fact that most of them care about the 'sex' of the words. Some of
- them are so regular that native writers rarely make spelling errors
- other than simple 'typing' errors ie transposition of characters. Some
- languages insist on strange characters that do not appear on the
- keyboard. The Arc copes quite well with the strange characters for
- languages such as french and German. Languages such as Esperanto are
- not so well provided for as the accents appear on unexpected
- characters and special provision would have to be given to defining
- them.
-
- In any case WordChk is just a word checker and not a full spelling
- checker (the difference is that one just tries to match a word to one
- in a list, the other attempts to manipulate the word in various ways
- to attempt to find the root.)
-
- British {word list being repaired}
- American {word list being repaired}
- Computing {work in progress}
- Hacking {work in progress}
- French need smaller word list!
- German need word list.
- Italian need word list.
- Esperanto can't display accented characters from default font.
- Latin need word list (getting a bit weird here?)
-
-
-
- =====================================================================
-
- The legal bit: I don't care what kind of ...ware it is called but I
- retain copyright on the code and encoded dictionary table used in this
- particular RiscOS implementation of the spelling checker algorithm.
- You can distribute version 0.03 of the WordCheck module and
- associated files as far and wide as you wish so long as this README
- file is also distributed with the module and hash file. You may
- include the !WordChk application (or just the language, Wordchk
- module and README files) within another application that makes use of
- its facilities. If you paid money (other than a small amount for disc
- duplication and postage) for these files then you've been ripped off.
-
- As noted above, this code is in alpha test and you take your own
- chances with bugs, spelling errors etc.
-
- ======================================================================
-